Skip to content

[WIP]Support DeepSeek V4 flash on SM120 with Triton fallback#40929

Open
bbbearxyz wants to merge 25 commits intovllm-project:mainfrom
bbbearxyz:support_sm120_deepseekv4
Open

[WIP]Support DeepSeek V4 flash on SM120 with Triton fallback#40929
bbbearxyz wants to merge 25 commits intovllm-project:mainfrom
bbbearxyz:support_sm120_deepseekv4

Conversation

@bbbearxyz
Copy link
Copy Markdown

@bbbearxyz bbbearxyz commented Apr 26, 2026

Issue:#40928
This PR is based on #40760
Tested on 2 x RTX Pro 6000 (SM120)

Summary

Support Triton fallback ops for DeepSeek V4 flash when DeepGEMM or FlashMLA is not available.

This PR adds a generic Triton implementation path for the DeepSeek V4 branch, including fallback kernels for sparse MLA attention, decode sparse attention, FP8 einsum, sparse attention indexer logits, and MHC prenorm GEMM. The existing optimized DeepGEMM / FlashMLA paths are still preferred when available; the Triton path is only used as a fallback.

Why

My approach for running DeepSeek V4 flash on SM120 is to provide a generic Triton implementation instead of hard-blocking execution on DeepGEMM or FlashMLA availability.

I think this is a reasonable fit for the vLLM DeepSeek V4 branch: when FlashMLA or DeepGEMM does not support a device yet, vLLM should still have a portable implementation that lets users run the model. Triton gives us a more general compatibility layer across GPU architectures, including SM120 and future SM architectures.

The goal of this PR is not to replace the optimized kernels. DeepGEMM and FlashMLA should remain the preferred paths when they are supported. However, when they are unavailable, the Triton fallback gives users a working implementation, even if there is still room for performance optimization.

This also keeps the migration cost low. If DeepGEMM adds SM120 support in the future, vLLM can switch SM120 back to the DeepGEMM path with minimal changes, while still keeping Triton as a portable fallback for other unsupported architectures.

Change

This PR supports DeepSeek V4 flash on SM120 by adding a generic Triton fallback path for kernels that currently depend on DeepGEMM or FlashMLA.

Main changes include:

  • Add Triton fallback kernels for DeepSeek V4 sparse MLA attention and decode sparse attention.
  • Add a Triton fallback implementation for the DeepSeek V4 FP8 einsum path.
  • Add Triton fallback kernels for sparse attention indexer logits.
  • Add a Triton fallback path for MHC prenorm GEMM.
  • Keep DeepGEMM / FlashMLA as the preferred optimized paths when available.
  • Fall back to Triton automatically when DeepGEMM or FlashMLA is unavailable, enabling DeepSeek V4 to run on SM120 and other future unsupported SM architectures.
  • Keep the implementation compatible with future migration to DeepGEMM once SM120 support becomes available.

Serving benchmark

random input len: 1024
random output len: 1024
num prompts: 32
max_model_len=8192
gpu_memory_utilization=0.9

TP=2, PP=1

max concurrency duration (h) Throughput (tok/s) Output throughput (tok/s)
1 0.1736 104.89 52.44
4 0.0548 332.01 166.00
8 0.0329 553.16 276.58
16 0.0227 800.37 400.19
32 0.0169 1076.35 538.17

zyongye and others added 22 commits April 25, 2026 20:01
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Co-authored-by: Yongye Zhu <yongye@inferact.ai>
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>
Co-authored-by: Simon Mo <simon@inferact.ai>
Co-authored-by: Bugen Zhao <i@bugenzhao.com>
Co-authored-by: Giancarlo Delfin <gdelfin@inferact.ai>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Roy Wang <yasong.wang@inferact.ai>
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai>
Co-authored-by: Yifan Qiao <yifanqiao@inferact.ai>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Zhewen Li <jerven.vllm@gmail.com>
Co-authored-by: Zijing Liu <liuzijing2014@gmail.com>
Co-authored-by: khluu <khluu000@gmail.com>
Co-authored-by: qizixi <zixi@inferact.ai>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
)

Co-authored-by: Zhewen Li <zhewenli@inferact.ai>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 26, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bbbearxyz.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Apr 26, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the DeepSeek V4 model architecture, featuring horizontally-fused MLA kernels, specialized MoE gating with softplus_sqrt, and MTP draft model integration. The changes include new CUDA and Triton kernels for optimized attention, quantization, and normalization, along with updates to Docker configurations and external dependencies like DeepGEMM and FlashMLA. Technical feedback identifies critical issues regarding the initialization of E8M0 scales, insufficient hardware capability guards for FP8 intrinsics in CUDA kernels which require SM89+, and a potential tensor reshape error in the Triton fallback logic.

Comment thread vllm/model_executor/layers/quantization/utils/fp8_utils.py
Comment thread csrc/fused_deepseek_v4_qnorm_rope_kv_insert_kernel.cu
Comment thread csrc/fused_deepseek_v4_qnorm_rope_kv_insert_kernel.cu
Comment thread vllm/model_executor/layers/deepseek_v4_triton_kernels.py
@bbbearxyz bbbearxyz force-pushed the support_sm120_deepseekv4 branch from aab48af to c1418ec Compare April 26, 2026 17:49
@bbbearxyz bbbearxyz requested a review from bigPYJ1151 as a code owner April 26, 2026 17:49
@mergify mergify Bot added frontend cpu Related to CPU backends and removed needs-rebase labels Apr 26, 2026
@bbbearxyz bbbearxyz changed the title Support DeepSeek V4 on SM120 with Triton fallback [WIP]Support DeepSeek V4 on SM120 with Triton fallback Apr 26, 2026
@bbbearxyz bbbearxyz force-pushed the support_sm120_deepseekv4 branch 2 times, most recently from 943bd81 to 9fca2b9 Compare April 26, 2026 18:14
Signed-off-by: bbbearxyz <mzj1996@mail.ustc.edu.cn>
@bbbearxyz bbbearxyz force-pushed the support_sm120_deepseekv4 branch from 9fca2b9 to b2a9e98 Compare April 26, 2026 18:16
@bbbearxyz bbbearxyz force-pushed the support_sm120_deepseekv4 branch from 0a93f94 to d521d3e Compare April 26, 2026 18:30
@bbbearxyz bbbearxyz changed the title [WIP]Support DeepSeek V4 on SM120 with Triton fallback [WIP]Support DeepSeek V4 flash on SM120 with Triton fallback Apr 26, 2026
@myshytf
Copy link
Copy Markdown

myshytf commented Apr 27, 2026

very nice.

jasl pushed a commit to jasl/vllm that referenced this pull request Apr 28, 2026
Cherry-picked from vllm-project#40929 commit b2a9e98.

Signed-off-by: bbbearxyz <mzj1996@mail.ustc.edu.cn>
Signed-off-by: jasl <jasl9187@hotmail.com>
jasl pushed a commit to jasl/vllm that referenced this pull request Apr 28, 2026
Cherry-picked from vllm-project#40929 commit b2a9e98.

Signed-off-by: bbbearxyz <mzj1996@mail.ustc.edu.cn>
Signed-off-by: jasl <jasl9187@hotmail.com>
aqua001 pushed a commit to aqua001/vllm that referenced this pull request May 2, 2026
Cherry-picked from vllm-project#40929 commit b2a9e98.

Signed-off-by: bbbearxyz <mzj1996@mail.ustc.edu.cn>
Signed-off-by: jasl <jasl9187@hotmail.com>
@ehfd
Copy link
Copy Markdown
Contributor

ehfd commented May 5, 2026

Please reference #38476 as well. Consideration for sm80/86/89 is also appreciated, since they can use TRITON as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build cpu Related to CPU backends deepseek Related to DeepSeek models documentation Improvements or additions to documentation frontend gpt-oss Related to GPT-OSS models kv-connector new-model Requests to new models nvidia performance Performance-related issues speculative-decoding tool-calling v1

Projects

Status: No status
Status: No status
Status: To Triage

Development

Successfully merging this pull request may close these issues.

9 participants